logo

The Libraries

In [1]:
#importing built-in libraries
import re
from collections import Counter
from math import log2

#importing numpy and pandas for data manipulation
import numpy as np
import pandas as pd

#importing plotly and cufflinks for creating visualizations
import plotly_express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from plotly.offline import init_notebook_mode
import plotly.io as pio
import cufflinks as cf
cf.go_offline()

# setting default template to plotly_dark for all visualizations
pio.templates.default = "plotly_dark"

# for charts to be rendered properly
init_notebook_mode()

#for saving the images (commented out for the report)
#import os
#import kaleido

Hypotheses

The Data

Time to spill some beans.

kevin chilli

In [2]:
#reading the data
office = pd.read_csv('The-Office-Lines-V4.csv', encoding='latin-1') 
                          
#dropping the repeated index columns
office = office.drop('Unnamed: 6', axis=1)

Understanding the data¶

The 'Office' dataset contains all the dialogues in the show, along with the name of the speaker and some other information.

In [3]:
office.head()
Out[3]:
season episode title scene speaker line
0 1 1 Pilot 1 Michael All right Jim. Your quarterlies look very good...
1 1 1 Pilot 1 Jim Oh, I told you. I couldn't close it. So...
2 1 1 Pilot 1 Michael So you've come to the master for guidance? Is ...
3 1 1 Pilot 1 Jim Actually, you called me in here, but yeah.
4 1 1 Pilot 1 Michael All right. Well, let me show you how it's done.

season : season number

episode : episode number

title : episode title

scene : scene number

speaker : speaker in the scene

line : lines of the speaker

Before going any further, let's check if there are any missing values in the datasets.

In [4]:
#checking for missing values
office.isnull().sum()
Out[4]:
season     0
episode    0
title      0
scene      0
speaker    0
line       0
dtype: int64

No NAs or NULLs - great success! Now, let's see how many speakers are there in the series.

In [5]:
print(office['speaker'].unique())
office['speaker'].unique().shape
['Michael' 'Jim' 'Pam' 'Dwight' 'Jan' 'Michel' 'Todd Packer' 'Phyllis'
 'Stanley' 'Oscar' 'Angela' 'Kevin' 'Ryan' 'Man' 'Roy' 'Mr. Brown' 'Toby'
 'Kelly' 'Meredith' 'Travel Agent' 'Man on Phone' 'Everybody' 'Lonny'
 'Darryl' 'Teammates' 'Michael and Dwight' 'Warehouse worker' 'Madge'
 'Worker' 'Katy' 'Guy at bar' 'Other Guy at Bar' 'Guy At Bar'
 'Pam and Jim' 'Employee' "Chili's Employee" 'Warehouse Guy'
 'Warehouse guy' 'Man in Video' 'Video' 'Actor' 'Redheaded Actress'
 "Mr. O'Malley" 'Albiny' "Pam's Mom" 'Carol' 'Bill' 'Everyone' 'Crowd'
 'song' 'Song' 'Dwight and Michael' 'Sherri' 'Creed' 'Devon' 'Children'
 'Kid' 'Ira' "Ryan's Voicemail" 'Christian' 'Hostess'
 'Michael and Christian' 'Sadiq (IT guy)' 'Mark' 'Improv Teacher'
 'Mary-Beth' 'Girl acting Pregnant' 'Actress' 'Michael and Jim'
 'Kevin & Oscar' 'All' 'Liquor Store Clerk' 'JIm' 'Bob Vance'
 'Phyllis, Meredith, Michael, Kevin' 'Captain Jack' 'Brenda'
 'Darryl and Katy' 'Jim and Pam' 'Billy Merchant' 'Doctor' 'Lab Tech'
 'Dana' "Hooter's Girls" 'Phylis' 'Gil' 'Pam and others' 'Ed' 'Packer'
 'Todd' "Jim's voicemail" 'Guy' 'Group chant' 'All the Men' 'Delivery man'
 'Craig' 'Josh' 'David' 'Dan' 'Overhead' 'Speaker' 'Jim and Dwight'
 'Melissa' 'Sasha' 'Abby' 'Jake' 'The Kids' 'Kids' 'Miss Trudy'
 'Edward R. Meow' 'Chet' 'Young Michael' 'Delivery Woman' 'Delivery Boy'
 'Office Staff' 'Store Employee' 'Pam/Jim' 'Linda' 'Hank'
 'I.D. Photographer' 'Photographer' 'Anglea' 'Female worker'
 "Billy's Girlfriend" 'Billy' 'Dealer' 'Bob' 'Andy' 'Karen'
 'Jerome Bettis' 'Ted' 'Waiter' 'Jim, Josh, and Dwight' 'Evan' 'Alan'
 'Ryan and others' 'Announcer' 'Pretzel guy' 'Cousin Mose' 'Tony' 'Server'
 'Girls' "Kelly's Mom" "Kelly's Father" 'Young Man' 'Andy and Jim'
 'Dwight ' 'M ichael' 'Michael ' 'Dwight:' 'Hannah' 'Martin' 'Male voice'
 'Michael & Dwight' 'Andy & Michael' 'Waitress' 'Chef' 'Woman at bar'
 'Cindy' 'Second Cindy' 'Other waitress' 'Andy and Michael' 'Both'
 'Harvey' 'Buyer' 'Kenny' 'Julius' 'Phone' 'Staples Guy' 'MIchael' 'Lady'
 'Paris' 'Marcy' 'Ben Franklin' 'Elizabeth' 'Priest' 'Uncle Al' 'Randy'
 'Unknown' 'Women' 'College Student' 'Business Student #1'
 'Business Student #2' 'Business Student #3' 'Woman' 'Artist' 'Rachel'
 'Dan Gore' 'Bartender' 'Student 1' 'Student 2' 'Child' 'Hunter' 'Darry'
 'Micheal' 'Chad Lite' 'Jamie' 'Barbara' 'School Official' 'Group'
 'Receptionist' 'IT Tech Guy' 'Nurse' 'Intern' 'Robert Dunder' 'Amy' 'GPS'
 'Larry Myers' 'Ex-client' 'Voice of Thomas Dean' 'sAndy' 'DunMiff/sys'
 'DwightKSchrute' 'Tech Guy' 'Angels' 'Pizza guy' 'Manager'
 'Voice #1 on phone' 'Voice #2 on phone' 'Micahel' 'Michae' 'Nick' 'Mose'
 'Co-Worker 1' 'Stanely' 'Micael' 'Vikram' 'Co-Worker 2' 'Co-Worker 3'
 'Mr. Figaro' 'Oscar and Stanley' 'Ad guy 1' 'Ad guy 2' 'David Wallace'
 'Andy, Creed, Kevin, Kelly, Darryl' 'Andy, Creed, Kevin, Kelly'
 "Michael's Ad" 'Rolando' 'Ben' 'Lester' 'Diane Kelly' 'Diane'
 'Deposition Reporter' 'Council' "Hunter's CD" 'Officer 1' 'Officer 2'
 'Officer' "Wendy's phone operator" 'Margaret' 'Coffee shop worker'
 'W.B. Jones' 'Paul Faust' 'Bill Cress' 'Paul' 'Michael/Dwight' 'Troy'
 'Girl in Club' 'Tall Girl #1' 'All Girls' 'Tall Girl #2'
 'Girl in 2nd club' 'Cleaning lady' 'Michael and Darryl' 'Phil Maguire'
 'Phil' 'Justin' 'Angela and Dwight' 'Maguire' 'Woman on mic'
 'Graphics guy' 'Holly' 'Woman over speakerphone'
 'Vance Refrigeration guy' 'Holy' 'Ronnie' 'Professor' 'Friend' 'JIM9334'
 'Receptionitis15' 'Michael & Holly' 'Dight' 'Kendall' 'Man on phone'
 'Hank ' 'Guy in audience' 'Michael and Holly'
 'Michael, Holly, and Darryl' 'Tom' 'Pete' 'Mother' 'Alex' 'Customer'
 'Stewardess' 'Beth' 'Concierge' 'Marie' 'Guy at table' 'Concierge Marie'
 'Client' 'Dacvid Walalce' 'David Wallcve' 'Dacvid Wallace' 'Leo'
 'Vance Refrigeration Guy' 'Police Officer 1' 'Police Officer 2'
 'Guy buying doll' 'Rehab Nurse' 'Everyone watching'
 'Entire Prince family' 'Prince Grandfather' 'Entire office' 'Jim '
 'Prince' 'Prince Granddaughter' 'Prince Grandmother' 'Prince Son'
 'Phyllis and Creed' 'Lawyer' 'CPR trainer' 'CPR Trainer' 'Rose'
 'Jessica Alba' 'Lily' 'Sam' 'Warehouse Michael' 'Julia' 'A.J.'
 'Phone Salesman' 'Jim, Pam, Michael and Dwight' 'Blood Drive Worker'
 'Blood Girl' 'Lynn' 'Blonde' 'Eric' 'Girl' 'Charles' 'Stephanie'
 'Employees' 'Isaac' 'Angela and Kelly' 'Supervisor' 'Michal' 'Nana'
 'Chares' 'Old Woman' 'Erin' 'Dwight and Erin' 'Dwight and Andy'
 'Michael, Pam & Ryan' 'Secretary' 'Automated phone voice' 'Mr. Schofield'
 'Financial Guy' 'Ty' 'Jessica' 'Vance Refrigeration Guy 1'
 'Vance Refrigeration Guy 2' 'VRG 1' 'VRG 2' 'Rolph' 'AJ'
 'Man from Buffalo' 'Woman from Buffalo' 'Dwight & Andy' 'Female Intern'
 'Female intern' 'Maurie' 'Megan' 'Gwenneth' 'Front Desk Clerk'
 'Mr. Halpert' 'Mema' 'Mr. Beesly' 'Little Girl' 'Penny' 'Isabel'
 'Hotel Employee' 'Hotel Manager' "Pam's mom" 'Tom Halpert' 'Pete Halpert'
 'Tom and Pete' "Pam's dad" 'Grotti' 'Andy and Dwight' 'Credit card rep'
 'Rep' 'Various' 'Keena Gifford' 'Helene' "David Wallace's Secretary"
 'Voice on CD player' 'Limo Driver' 'Jim & Pam' 'Laurie' 'Registrar'
 'Security' 'Woman in line' 'Man in line' 'Shareholder'
 'Female Shareholder' 'Second Shareholder' 'Third Shareholder'
 'Fourth Shareholder' "O'Keefe" 'Mikela' 'Students' 'Teacher' 'Lefevre'
 'Zion' 'Deliveryman' 'Michael and Erin' 'Daryl' 'Office' 'Kelly and Erin'
 'Matt' 'Computron' 'Fake Stanley' 'Gabe' 'Andy & Erin' 'Christian Slater'
 'Jo Bennett' 'Jo' 'Jerry' 'Teddy Wallace' 'Mrs. Wallace' 'Teddy'
 'Dwight, Jim and Michael' 'Policeman' 'Hospital employee'
 "(Pam's mom) Heleen" 'Kathy' 'Dale' 'Clark' ' Jim' 'Isabelle' 'D'
 'Warehouse guy 1' 'Warehouse guy 2' 'Reid' 'Night cleaning crew'
 'Miichael' 'Dwight: ' 'Michael: ' 'Jim: ' 'Meredith: ' 'Angela: '
 'Creed: ' 'Phyllis: ' 'Everyone: ' 'Oscar: ' 'Stanley: ' 'Matt: '
 'Warehouse Guy: ' 'Darryl: ' 'Andy: ' 'Pam: ' 'Erin: ' 'Kevin: '
 'Julie: ' 'Isabel: ' 'Hide: ' 'Ryan: ' 'Kelly: ' 'Bar Manager: '
 'Bouncer: ' 'Girl at table: ' 'Cookie Monster' 'Dwight.'
 "Hayworth's waiter" "Oscar's voice from the computer" 'Donna' 'Mihael'
 'Hide' 'Old lady' 'Glen' 'Gym Instructor' 'Gym instructor'
 'Dwight and Angela' 'Shane' 'Reporter' 'Realtor' 'Luke'
 'Window treatment guy' 'Angel' 'Salesman' 'Usher' 'Shelby' 'Sweeney Todd'
 'Son' 'Nate' 'Employees except Dwight' 'Astrid' 'Carroll' 'Carrol'
 'Danny' 'Steve' 'Darryl and Andy' 'Church congregation' 'Pastor'
 ' Pastor' 'Female church member' 'Male church member' 'Doug' 'Mee-Maw'
 'MeeMaw' 'Carla' "Jim's Dad" 'Bus driver' 'Michael and Andy'
 'Another guy' 'Radio' 'TV' 'Meridith' 'Robotic Voice' 'Ryan and Michael'
 'Phyliss' 'Dwight & Nate' 'Passer-by' 'Pam ' 'Bass Player' 'Justine'
 'Jada' 'Robert' 'Darrly' 'Member' 'Video Michael' 'Bookstore employee'
 'DJ' 'David Brent' 'Older guy' 'Phyllis, Stanley, Dwight' 'Younger Guy'
 'Older Woman' 'Professor Powell' 'Ryan and Kelly' 'Helen' 'Attendant'
 'Hot Dog Guy' 'Cell Phone Sales Person' 'Boom Box' 'Andy and Erin'
 'Delivery' 'Samuel' 'President' 'Goldenface' 'Cherokee Jack'
 'Michael and Samuel together' "Holly's Mom" "Holly's Dad" 'Deangelo'
 'Deangelo/Michael' 'Denagelo' "Darryl's sister" 'DeAngelo' '"Jo"'
 '"Angela"' '"Jim"' '"Phyllis"' 'Together' 'Audience' 'Erin and Kelly'
 'abe' 'Rory' 'DeAgnelo' 'Jordan' 'All but Oscar' ' Jo'
 'Darryl and Angela' 'Fred Henry' 'Fred' 'Warren Buffett' 'Warren'
 'Robert California' 'Merv Bronte' 'Merv' 'Nellie Bertram' 'Nellie'
 'Finger Lakes Guy' 'Pam as "fourth-biggest client"'
 'Pam as "ninth-biggest client"' 'Tattoo Artist' 'Female Applicant'
 'Male Applicant 1' 'Male Applicant 2' 'Gideon' 'Bruce'
 'Dwight, Erin, Jim & Kevin' 'Walter' 'Ellen' 'Walter Jr' 'Andy & Walter'
 'Walter & Walter Jr' "Erin's Cell Phone" 'Bert' 'Gabe/Kelly/Toby'
 'Andy/Pam' 'Andy/Stanley' 'Val' 'Warehouse Crew' 'Cathy' 'Offscreen'
 'Curtis' 'Drummer' 'Pam and Kelly' 'Old Man' 'Andy and Darryl'
 'Darryl and Kevin' 'Park Ranger' 'Chelsea' "Chelsea's Mom" 'Archivist'
 'Narrator' 'Soldier' 'Amanda' 'Susan' 'Andy/Oscar' 'Host'
 'Queerenstein Bears' "Oscar's friend" 'Stu' 'Stonewall Host'
 'Senator Lipton' 'Ernesto' 'Cece' 'Saleswoman' 'Emergency Operator'
 'Paramedic' 'Donna Muraski' 'Wally Amos' 'Angela/Pam' 'Brandon' 'Blogger'
 'Blogger 2' 'Lady Blogger' 'Patty' 'Old Lady' 'Others' 'Elderly Woman'
 'Irene' 'Alonzo' 'Glenn' 'Kevin & Meredith' 'Lauren' 'Party guests'
 'Magician' 'Ravi' 'Robert & Creed' 'Wrangler' 'Senator' 'Vet' 'Harry'
 'Mr. Ramish' 'Calvin' 'Off-camera' 'Rafe' 'Fake Jim' 'Voicemail'
 'Nellie and Pam' 'Video Andy' 'Phyllis, Kevin & Stanley' 'HCT Member #1'
 'HCT Member #2' 'Broccoli Rob' 'Businessman #1' 'Businessman #2'
 'Businessman #3' 'HCT' 'HCT Member #3' 'White' 'Boat Guy' 'Walt Jr.'
 'Senator Liptop' 'Business partner' 'Molly' 'Colin' 'Trevor'
 'Julius Irving' 'New Instant Message' 'Suit Store Father'
 'Athlead Employee' 'Dennis' 'Wade' 'Suit Store Son'
 'Female Athlead Employee' '3rd Athlead Employee' '4th Athlead Employee'
 'Co-worker' 'Co-worker #2' 'Mr. Romanko' 'Dance Teacher' 'Ballerinas'
 'Parent in Audience' 'Parent in audience #2' 'Parent in audience #1'
 'Investor' 'Lonnie' 'Fast Food Worker' 'Drive Thru Customer' 'Brian'
 'Cameraman' 'Rolf' 'Gabor' 'Zeke' 'Melvina' 'Wolf' 'Sensei Ira' 'Frank'
 'Party Announcer' 'Party Guest' 'Party Photographer' 'Party Waiter'
 'Nail stylist 1' 'Nail stylist 2' 'Nail manager' 'Shirley'
 'Athlead Coworker' 'Roger' 'Alice' "Oscar's Computer" 'Jeb'
 'German Minister' 'Fannie' 'Henry' 'Esther' 'Aunt Shirley' 'Cameron'
 'Promo Voice' 'Ryan Howard' 'Mr. Ruger' 'Ruger Sister 1' 'Salesmen'
 'Ruger Sister 2' 'Angela & Oscar' 'Reporter #1' 'Reporter #2'
 'Mrs. Davis' 'Carla Fern' 'Director' 'Producer'
 'Bob Vance, Vance Refrigeration' 'Production Assistant' 'Sensei' 'Philip'
 'Check-in guy' 'Casey' 'Mark McGrath' 'Jim & Dwight' 'Camera Crew'
 'Phillip' 'People in line' 'Santigold' 'Aaron Rodgers' 'Clay Aiken'
 'Camera Man' 'Malcolm' 'Casey Dean' 'Seth Mayers' 'Bill Hader' 'Dakota'
 'Stripper' 'Jakey' 'Man 1' 'Woman 1' 'Woman 2' 'Man 2' 'Moderator'
 'Man 3' 'Woman 3' 'Woman 4' 'Joan' 'Minister' 'Carol Stills']
Out[5]:
(775,)

Well, there's a lot of them - 775 to be exact. However, not all of them are unique characters, e.g. one example is two lines described as "Andy & Michael" and "Andy and Michael" - counted as two, despite being the same characters. Therefore, there are definitely more reocurring names due to spelling errors, the way the script is written and other factors.

In this analysis, I'm mainly focusing on the core characters of the series.

Analysis¶

Calculating Entropy

Let's start by cleaning the data.

In [6]:
def formatLine(line):
    line = line.lower()
    line = re.sub(r'[^\w\s]','',line)
    return line

office['line_formatted'] = office['line'].apply(lambda x:formatLine(x))

office['line_formatted'].head()
Out[6]:
0    all right jim your quarterlies look very good ...
1                  oh i told you i couldnt close it so
2    so youve come to the master for guidance is th...
3              actually you called me in here but yeah
4          all right well let me show you how its done
Name: line_formatted, dtype: object

In this step, I've created a properly formatted version of each line. Using regex, I changed the text to be only in lowercase and removed any unnecessary symbols, like commas or dots, because further analysis concerns only letters and words.

Next, in order to calculate Shannon Entropy, it's obviously required to have a proper function. In this case, I defined it myself, using a simple function seen below.

In [7]:
def entropy(text):
    counter = Counter(text)
    total = sum(counter.values())
    return -sum( count/total * log2(count/total) for count in counter.values())

After defining the function, I applied it to the entire dataframe - to both formatted and not formatted lines, in order to see if it works properly.

In [8]:
office['entropy'] = office['line'].apply(entropy)

office['entropy_formatted'] = office['line_formatted'].apply(entropy)

office.head()
Out[8]:
season episode title scene speaker line line_formatted entropy entropy_formatted
0 1 1 Pilot 1 Michael All right Jim. Your quarterlies look very good... all right jim your quarterlies look very good ... 4.239712 3.999839
1 1 1 Pilot 1 Jim Oh, I told you. I couldn't close it. So... oh i told you i couldnt close it so 3.851149 3.364299
2 1 1 Pilot 1 Michael So you've come to the master for guidance? Is ... so youve come to the master for guidance is th... 4.245317 3.975739
3 1 1 Pilot 1 Jim Actually, you called me in here, but yeah. actually you called me in here but yeah 3.927418 3.675892
4 1 1 Pilot 1 Michael All right. Well, let me show you how it's done. all right well let me show you how its done 3.987594 3.695948

As expected, the formatted lines have lower entropy, as there are less symbols to cause the chaos.

Now, it's time to plot the entropy distribution.

In [9]:
# defining data for the plot
x0 = office['entropy']
x1 = office['entropy_formatted']

# starting the plot
fig = go.Figure()
fig.add_trace(go.Histogram(x=x0, name = 'Original'))
fig.add_trace(go.Histogram(x=x1, name = 'Formatted'))

# overlaying the histograms
fig.update_layout(barmode='overlay', title_text='Entropy of Original and Formatted Lines')
fig.update_xaxes(title_text='Entropy')
fig.update_yaxes(title_text='Count')

# reducing opacity for better visibility and showing the plot
fig.update_traces(opacity=0.75)
fig.show()
In [10]:
#saving the image
#fig.write_image('graphs/entropy_formatted_vs_not_formatted.png', engine="kaleido")

Original lines exhibit a broader range of entropy, including both very low and very high values, suggesting greater linguistic variability in unprocessed dialogue. In contrast, formatted lines show more concentrated peaks, particularly at lower entropy values (around 1 and 2), likely reflecting the effects of normalization or repetitive phrases. Both distributions converge in the higher entropy range (3.5–4.5), which likely corresponds to the rich, diverse dialogue characteristic of the show's humor and character interactions. This indicates that formatting reduces variability while preserving the core structure of the more dynamic, unpredictable lines central to the narrative flow.

Calculating mean entropy for each season of the show.

In [11]:
entropy_season_one = office[office['season'] == 1]['entropy'].mean()
entropy_season_two = office[office['season'] == 2]['entropy'].mean()
entropy_season_three = office[office['season'] == 3]['entropy'].mean()
entropy_season_four = office[office['season'] == 4]['entropy'].mean()
entropy_season_five = office[office['season'] == 5]['entropy'].mean()
entropy_season_six = office[office['season'] == 6]['entropy'].mean()
entropy_season_seven = office[office['season'] == 7]['entropy'].mean()
entropy_season_eight = office[office['season'] == 8]['entropy'].mean()
entropy_season_nine = office[office['season'] == 9]['entropy'].mean()

#creating a dataframe for the entropy by season

seasons_entropy = pd.DataFrame({'Season': ['1', '2', '3', '4', '5', '6', '7', '8', '9'],
                                'Entropy': [entropy_season_one, entropy_season_two, entropy_season_three, entropy_season_four, entropy_season_five, entropy_season_six, entropy_season_seven, entropy_season_eight, entropy_season_nine]})


#plotting the data 

fig = px.bar(seasons_entropy,
             x='Season',
             y='Entropy',
             title='Character Entropy by Season',
             color='Entropy',
             text_auto=True,
             range_color=[3.5, 3.7],)
fig.show()

Entropy values remain relatively stable throughout. Season 1 shows slightly higher entropy (3.658), suggesting a marginally more varied or complex distribution of characters compared to later seasons. Seasons 5 and 4 have slightly lower entropy (around 3.587), possibly due to stylistic or structural changes in the script. The near-uniformity across seasons suggests that the writing maintained a consistent level of textual variation, indicative of a stable narrative and linguistic style over time.

In [12]:
#saving the image
#fig.write_image('graphs/entropy_by_season.png', engine="kaleido")

Creating a smaller dataframe for each major character.

In [13]:
michael = office[office['speaker'] == 'Michael']
dwight = office[office['speaker'] == 'Dwight']
jim = office[office['speaker'] == 'Jim']
pam = office[office['speaker'] == 'Pam']
andy = office[office['speaker'] == 'Andy']
toby = office[office['speaker'] == 'Toby']
stanley = office[office['speaker'] == 'Stanley']
kelly = office[office['speaker'] == 'Kelly']
ryan = office[office['speaker'] == 'Ryan']
phyllis = office[office['speaker'] == 'Phyllis']
oscar = office[office['speaker'] == 'Oscar']
darryl = office[office['speaker'] == 'Darryl']
jan = office[office['speaker'] == 'Jan']
creed = office[office['speaker'] == 'Creed']
meredith = office[office['speaker'] == 'Meredith']
angela = office[office['speaker'] == 'Angela']
kevin = office[office['speaker'] == 'Kevin']
erin = office[office['speaker'] == 'Erin']

#erin.head()

Calculating mean entropy for each character.

Firstly, regular script lines.

In [14]:
michael_entropy = michael['entropy'].mean()
dwight_entropy = dwight['entropy'].mean()
jim_entropy = jim['entropy'].mean()
pam_entropy = pam['entropy'].mean()
andy_entropy = andy['entropy'].mean()
toby_entropy = toby['entropy'].mean()
stanley_entropy = stanley['entropy'].mean()
kelly_entropy = kelly['entropy'].mean()
ryan_entropy = ryan['entropy'].mean()
phyllis_entropy = phyllis['entropy'].mean()
oscar_entropy = oscar['entropy'].mean()
darryl_entropy = darryl['entropy'].mean()
jan_entropy = jan['entropy'].mean()
creed_entropy = creed['entropy'].mean()
meredith_entropy = meredith['entropy'].mean()
angela_entropy = angela['entropy'].mean()
kevin_entropy = kevin['entropy'].mean()
erin_entropy = erin['entropy'].mean()

Now, formatted lines.

In [15]:
michael_entropy_formatted = michael['entropy_formatted'].mean()
dwight_entropy_formatted = dwight['entropy_formatted'].mean()
jim_entropy_formatted = jim['entropy_formatted'].mean()
pam_entropy_formatted = pam['entropy_formatted'].mean()
andy_entropy_formatted = andy['entropy_formatted'].mean()
toby_entropy_formatted = toby['entropy_formatted'].mean()
stanley_entropy_formatted = stanley['entropy_formatted'].mean()
kelly_entropy_formatted = kelly['entropy_formatted'].mean()
ryan_entropy_formatted = ryan['entropy_formatted'].mean()
phyllis_entropy_formatted = phyllis['entropy_formatted'].mean()
oscar_entropy_formatted = oscar['entropy_formatted'].mean()
darryl_entropy_formatted = darryl['entropy_formatted'].mean()
jan_entropy_formatted = jan['entropy_formatted'].mean()
creed_entropy_formatted = creed['entropy_formatted'].mean()
meredith_entropy_formatted = meredith['entropy_formatted'].mean()
angela_entropy_formatted = angela['entropy_formatted'].mean()
kevin_entropy_formatted = kevin['entropy_formatted'].mean()
erin_entropy_formatted = erin['entropy_formatted'].mean()

show_entropy = office['entropy'].mean()
show_entropy_formatted = office['entropy_formatted'].mean()

Comparing the two entropies.

In [16]:
fig = go.Figure(data=[
    go.Bar(name='Raw Text Entropy', x=['Michael','Dwight','Jim','Pam','Andy','Toby','Stanley','Kelly','Ryan','Phyllis','Oscar','Darryl','Jan','Creed','Meredith','Angela','Kevin','Erin'], y=[michael_entropy,dwight_entropy,jim_entropy,pam_entropy,andy_entropy,toby_entropy,stanley_entropy,kelly_entropy,ryan_entropy,phyllis_entropy,oscar_entropy,darryl_entropy,jan_entropy,creed_entropy,meredith_entropy,angela_entropy,kevin_entropy,erin_entropy]),
    go.Bar(name='Formatted Text Entropy', x=['Michael','Dwight','Jim','Pam','Andy','Toby','Stanley','Kelly','Ryan','Phyllis','Oscar','Darryl','Jan','Creed','Meredith','Angela','Kevin','Erin'], y=[michael_entropy_formatted,dwight_entropy_formatted,jim_entropy_formatted,pam_entropy_formatted,andy_entropy_formatted,toby_entropy_formatted,stanley_entropy_formatted,kelly_entropy_formatted,ryan_entropy_formatted,phyllis_entropy_formatted,oscar_entropy_formatted,darryl_entropy_formatted,jan_entropy_formatted,creed_entropy_formatted,meredith_entropy_formatted,angela_entropy_formatted,kevin_entropy_formatted,erin_entropy_formatted])
])

# Change the bar mode
fig.update_layout(barmode='group')
fig.update_layout(title='Character Entropy - raw versus formatted text', yaxis_title='Entropy', xaxis_title='Character')
fig.show()

Across all names, raw text entropy (blue bars) is consistently higher than formatted text entropy (red bars), reflecting that formatting reduces variability in character distributions, as stated before. The differences between raw and formatted entropy are relatively uniform across all characters, indicating that the formatting process has a consistent impact regardless of individual textual distributions. Overall, this suggests that while the raw text retains more nuanced diversity, the formatting process simplifies the distribution while still maintaining similar overall patterns for each character.

In [17]:
#saving the image
#fig.write_image('graphs/entropy_by_character_formatted_vs_unformatted.png', engine="kaleido")

Looking which characters had the biggest entropy.

In [18]:
#creating a dataframe out of the formatted values

formatted_entropies = pd.DataFrame({'speaker': ['Show', 'Michael','Dwight','Jim','Pam','Andy','Toby','Stanley','Kelly','Ryan','Phyllis','Oscar','Darryl','Jan','Creed','Meredith','Angela','Kevin','Erin'],
                                    'Entropy': [show_entropy, michael_entropy,dwight_entropy,jim_entropy,pam_entropy,andy_entropy,toby_entropy,stanley_entropy,kelly_entropy,ryan_entropy,phyllis_entropy,oscar_entropy,darryl_entropy,jan_entropy,creed_entropy,meredith_entropy,angela_entropy,kevin_entropy,erin_entropy],
                                    'entropy_formatted': [show_entropy_formatted, michael_entropy_formatted,dwight_entropy_formatted,jim_entropy_formatted,pam_entropy_formatted,andy_entropy_formatted,toby_entropy_formatted,stanley_entropy_formatted,kelly_entropy_formatted,ryan_entropy_formatted,phyllis_entropy_formatted,oscar_entropy_formatted,darryl_entropy_formatted,jan_entropy_formatted,creed_entropy_formatted,meredith_entropy_formatted,angela_entropy_formatted,kevin_entropy_formatted,erin_entropy_formatted]})
formatted_entropies

#sorting the speakers by entropy

formatted_entropies = formatted_entropies.sort_values(by='Entropy', ascending=False)

fig = px.bar(formatted_entropies,
             x='speaker',
             y='Entropy',
             title='Character Entropy by Speaker',
             color = 'Entropy',
             text_auto = True)

fig.update_layout(yaxis_title='Entropy', xaxis_title='Speaker')

fig.add_shape(
    name="show",
    showlegend=False,
    type="rect",
    line=dict(dash="dash"),
    x0=7.4,
    x1=6.6,
    y0=0,
    y1=3.63,
)

fig.show()

Kelly exhibits the highest entropy (3.704), indicating a broader diversity in letter usage within her dialogue, while Kevin has the lowest entropy (3.499), reflecting a more limited variety of characters in his lines. The entropy values across speakers remain relatively close, suggesting consistent linguistic patterns in the script, with minor variations likely driven by unique speech styles or vocabulary choices tied to individual characters. The inclusion of a combined "Show" entropy value (3.623) provides a baseline for overall textual diversity across all speakers.

Summing up, the mean entropy is lower in every case after formatting the text. From now on, let's only focus on the entropy of formatted lines.

In [19]:
#saving the image
#fig.write_image('graphs/entropy_entropy_by_character_sorted.png', engine="kaleido")

Now, let's calculate how the entropy for each of the characters has changed over the seasons.

Step one: Michael

In [20]:
michael_entropy_season_one = michael[michael['season'] == 1]['entropy'].mean()
michael_entropy_season_two = michael[michael['season'] == 2]['entropy'].mean()
michael_entropy_season_three = michael[michael['season'] == 3]['entropy'].mean()
michael_entropy_season_four = michael[michael['season'] == 4]['entropy'].mean()
michael_entropy_season_five = michael[michael['season'] == 5]['entropy'].mean()
michael_entropy_season_six = michael[michael['season'] == 6]['entropy'].mean()
michael_entropy_season_seven = michael[michael['season'] == 7]['entropy'].mean()

#as Michael was present for (sadly) only 7 seasons, it's not necessary to consider the rest of the seasons for him

#creating a dataframe for the entropy of Michael's lines by season

michael_entropies = pd.DataFrame({'Season': ['1', '2', '3', '4', '5', '6', '7'],
                                'Entropy': [michael_entropy_season_one, michael_entropy_season_two, michael_entropy_season_three, michael_entropy_season_four, michael_entropy_season_five, michael_entropy_season_six, michael_entropy_season_seven]})

#plotting the entropy of Michael's lines over the seasons])

fig = px.bar(michael_entropies,
                x='Season',
                y='Entropy',
                title='Michael\'s Character Entropy by Season',
                color='Entropy',
                range_color=[3.4, 3.9],
                text_auto = True)
fig.show()

Michael's character (letter) entropy declines from Season 1 (3.856) to Season 5 (3.630), followed by a slight recovery in later seasons. The high entropy in Season 1 indicates a more varied and experimental linguistic style in Michael’s dialogue early in the series - as it was a 1:1 remake of the UK "The Office" series - and the decline represents that his persona was indeed tamed after season 1. The decrease in entropy in subsequent seasons suggests a gradual simplification and standardization of his dialogue, aligning with the character's already established comedic tone. The stabilization in Seasons 6 and 7 (around 3.68) reflects consistency in Michael’s character's textual patterns as the show matured.

In [21]:
#saving the image
#fig.write_image('graphs/entropy_michael_by_season.png', engine="kaleido")

Dwight

In [22]:
dwight_entropy_season_one = dwight[dwight['season'] == 1]['entropy'].mean()
dwight_entropy_season_two = dwight[dwight['season'] == 2]['entropy'].mean()
dwight_entropy_season_three = dwight[dwight['season'] == 3]['entropy'].mean()
dwight_entropy_season_four = dwight[dwight['season'] == 4]['entropy'].mean()
dwight_entropy_season_five = dwight[dwight['season'] == 5]['entropy'].mean()
dwight_entropy_season_six = dwight[dwight['season'] == 6]['entropy'].mean()
dwight_entropy_season_seven = dwight[dwight['season'] == 7]['entropy'].mean()
dwight_entropy_season_eight = dwight[dwight['season'] == 8]['entropy'].mean()
dwight_entropy_season_nine = dwight[dwight['season'] == 9]['entropy'].mean()

#creating a dataframe for the entropy of Dwight's lines by season

dwight_entropies = pd.DataFrame({'Season': ['1', '2', '3', '4', '5', '6', '7', '8', '9'],
                                'Entropy': [dwight_entropy_season_one, dwight_entropy_season_two, dwight_entropy_season_three, dwight_entropy_season_four, dwight_entropy_season_five, dwight_entropy_season_six, dwight_entropy_season_seven, dwight_entropy_season_eight, dwight_entropy_season_nine]})

#plotting the entropy of Dwight's lines over the seasons

fig = px.bar(dwight_entropies,
                x='Season',
                y='Entropy',
                title='Dwight\'s Character Entropy by Season',
                color='Entropy',
                range_color=[3.4, 3.9],
                text_auto = True)
fig.show()

Dwight's entropy starts at 3.669 in Season 1 and slightly decreases through Seasons 2 to 4, reaching its lowest point at 3.599 in Season 4. From Season 5 onwards, entropy steadily increases, peaking at 3.745 in Season 9. This upward trend in later seasons suggests an increase in linguistic diversity and complexity in Dwight's dialogue, reflecting character development and changes in his role within the show's narrative. The stable yet slight variations across seasons highlight consistent textual patterns in Dwight's speech, with nuanced shifts over time.

In [23]:
#saving the image
#fig.write_image('graphs/entropy_dwight_by_season.png', engine="kaleido")

Jim

In [24]:
jim_entropy_season_one = jim[jim['season'] == 1]['entropy'].mean()
jim_entropy_season_two = jim[jim['season'] == 2]['entropy'].mean()
jim_entropy_season_three = jim[jim['season'] == 3]['entropy'].mean()
jim_entropy_season_four = jim[jim['season'] == 4]['entropy'].mean()
jim_entropy_season_five = jim[jim['season'] == 5]['entropy'].mean()
jim_entropy_season_six = jim[jim['season'] == 6]['entropy'].mean()
jim_entropy_season_seven = jim[jim['season'] == 7]['entropy'].mean()
jim_entropy_season_eight = jim[jim['season'] == 8]['entropy'].mean()
jim_entropy_season_nine = jim[jim['season'] == 9]['entropy'].mean()

#creating a dataframe for the entropy of Jim's lines by season

jim_entropies = pd.DataFrame({'Season': ['1', '2', '3', '4', '5', '6', '7', '8', '9'],
                              'Entropy': [jim_entropy_season_one, jim_entropy_season_two, jim_entropy_season_three, jim_entropy_season_four, jim_entropy_season_five, jim_entropy_season_six, jim_entropy_season_seven, jim_entropy_season_eight, jim_entropy_season_nine]})

#plotting the entropy of Jim's lines over the seasons

fig = px.bar(jim_entropies,
                x='Season',
                y='Entropy',
                title='Jim\'s Character Entropy by Season',
                color='Entropy',
                range_color=[3.4, 3.9],
                text_auto = True)
fig.show()

For Jim, Season 1 starts with relatively high entropy (3.649), indicating a diverse range of letter usage in his dialogue. Entropy declines slightly in Seasons 2 and 3, reaching its lowest point in Season 3 (3.458), which reflects a more consistent and streamlined textual style during that period. Entropy stabilizes from Seasons 4 to 7, before showing a gradual increase in Seasons 8 and 9, peaking at 3.695 in the final season. This rise in later seasons shows his real character development, and the low point of Season 3 is a great representation of how his character was feeling during that time - lost, confused and set aside.

In [25]:
#saving the image
#fig.write_image('graphs/entropy_jim_by_season.png', engine="kaleido")

Pam

In [26]:
pam_entropy_season_one = pam[pam['season'] == 1]['entropy'].mean()
pam_entropy_season_two = pam[pam['season'] == 2]['entropy'].mean()
pam_entropy_season_three = pam[pam['season'] == 3]['entropy'].mean()
pam_entropy_season_four = pam[pam['season'] == 4]['entropy'].mean()
pam_entropy_season_five = pam[pam['season'] == 5]['entropy'].mean()
pam_entropy_season_six = pam[pam['season'] == 6]['entropy'].mean()
pam_entropy_season_seven = pam[pam['season'] == 7]['entropy'].mean()
pam_entropy_season_eight = pam[pam['season'] == 8]['entropy'].mean()
pam_entropy_season_nine = pam[pam['season'] == 9]['entropy'].mean()

#creating a dataframe for the entropy of Pam's lines by season

pam_entropies = pd.DataFrame({'Season': ['1', '2', '3', '4', '5', '6', '7', '8', '9'],
                                'Entropy': [pam_entropy_season_one, pam_entropy_season_two, pam_entropy_season_three, pam_entropy_season_four, pam_entropy_season_five, pam_entropy_season_six, pam_entropy_season_seven, pam_entropy_season_eight, pam_entropy_season_nine]})

#plotting the entropy of Pam's lines over the seasons

fig = px.bar(pam_entropies,
                x='Season',
                y='Entropy',
                title='Pam\'s Character Entropy by Season',
                color='Entropy',
                range_color=[3.4, 3.9],
                text_auto = True)
fig.show()

Starting at 3.399 in Season 1, entropy gradually increases over the course of the series, reaching 3.611 in Season 9. This progression suggests that Pam’s dialogue became more diverse and complex as her character developed and her narrative role expanded. The increase is steady, with noticeable jumps in later seasons (e.g., Season 7 at 3.656 and Season 8 at 3.605), indicating a consistent evolution in linguistic patterns. Overall, the trend reflects Pam’s growing presence and a shift toward more dynamic dialogue in her interactions.

In [27]:
#saving the image
#fig.write_image('graphs/entropy_pam_by_season.png', engine="kaleido")

Calculating N-gram Entropy

In this section, instead of calculating character entropies, I will focus on n-gram entropies

First, let's start by defining a function to calculate n-gram entropies.

In [28]:
def text_to_n_gram_sequence(text, n):

    sequence = []
    n_gram_dict = {}
    next_key = 0

    for i in range(len(text) - n + 1):
        gram = text[i:i + n]
        if gram not in n_gram_dict:
            n_gram_dict[gram] = next_key
            next_key += 1
        sequence.append(n_gram_dict[gram])

    return sequence

Now, time for creating a corpora for each character.

In [29]:
michael_corpus = ' '.join(michael['line_formatted'])
dwight_corpus = ' '.join(dwight['line_formatted'])
jim_corpus = ' '.join(jim['line_formatted'])
pam_corpus = ' '.join(pam['line_formatted'])
andy_corpus = ' '.join(andy['line_formatted'])
toby_corpus = ' '.join(toby['line_formatted'])
stanley_corpus = ' '.join(stanley['line_formatted'])
kelly_corpus = ' '.join(kelly['line_formatted'])
ryan_corpus = ' '.join(ryan['line_formatted'])
phyllis_corpus = ' '.join(phyllis['line_formatted'])
oscar_corpus = ' '.join(oscar['line_formatted'])
darryl_corpus = ' '.join(darryl['line_formatted'])
jan_corpus = ' '.join(jan['line_formatted'])
creed_corpus = ' '.join(creed['line_formatted'])
meredith_corpus = ' '.join(meredith['line_formatted'])
angela_corpus = ' '.join(angela['line_formatted'])
kevin_corpus = ' '.join(kevin['line_formatted'])
erin_corpus = ' '.join(erin['line_formatted'])
In [30]:
#checking the first 150 characters of Michael's corpus
print(michael_corpus[:150])
#success
all right jim your quarterlies look very good how are things at the library so youve come to the master for guidance is this what youre saying grassho

And for the whole series.

In [31]:
office_corpus = ' '.join(office['line_formatted'])
In [32]:
#checking the first 150 characters of the entire corpora
print(office_corpus[:150])
#success
all right jim your quarterlies look very good how are things at the library oh i told you i couldnt close it so so youve come to the master for guidan

Now, let's get the entropies for the four main characters.

In [33]:
entropy_michael = [entropy(text_to_n_gram_sequence(michael_corpus, i)) for i in range(1, 20)]
entropy_dwight = [entropy(text_to_n_gram_sequence(dwight_corpus, i)) for i in range(1, 20)]
entropy_jim = [entropy(text_to_n_gram_sequence(jim_corpus, i)) for i in range(1, 20)]
entropy_pam = [entropy(text_to_n_gram_sequence(pam_corpus, i)) for i in range(1, 20)]

And for the whole script.

In [34]:
entropy_office = [entropy(text_to_n_gram_sequence(office_corpus, i)) for i in range(1, 20)]

And for all of the major characters.

In [35]:
entropy_andy = [entropy(text_to_n_gram_sequence(andy_corpus, i)) for i in range(1, 20)]
entropy_toby = [entropy(text_to_n_gram_sequence(toby_corpus, i)) for i in range(1, 20)]
entropy_stanley = [entropy(text_to_n_gram_sequence(stanley_corpus, i)) for i in range(1, 20)]
entropy_kelly = [entropy(text_to_n_gram_sequence(kelly_corpus, i)) for i in range(1, 20)]
entropy_ryan = [entropy(text_to_n_gram_sequence(ryan_corpus, i)) for i in range(1, 20)]
entropy_phyllis = [entropy(text_to_n_gram_sequence(phyllis_corpus, i)) for i in range(1, 20)]
entropy_oscar = [entropy(text_to_n_gram_sequence(oscar_corpus, i)) for i in range(1, 20)]
entropy_darryl = [entropy(text_to_n_gram_sequence(darryl_corpus, i)) for i in range(1, 20)]
entropy_jan = [entropy(text_to_n_gram_sequence(jan_corpus, i)) for i in range(1, 20)]
entropy_creed = [entropy(text_to_n_gram_sequence(creed_corpus, i)) for i in range(1, 20)]
entropy_meredith = [entropy(text_to_n_gram_sequence(meredith_corpus, i)) for i in range(1, 20)]
entropy_angela = [entropy(text_to_n_gram_sequence(angela_corpus, i)) for i in range(1, 20)]
entropy_kevin = [entropy(text_to_n_gram_sequence(kevin_corpus, i)) for i in range(1, 20)]
entropy_erin = [entropy(text_to_n_gram_sequence(erin_corpus, i)) for i in range(1, 20)]

Now, let's plot the entropies for the characters.

In [36]:
fig = make_subplots(rows=2, cols=2, subplot_titles=('Michael', 'Dwight', 'Jim', 'Pam'))

#michael
fig.add_trace(go.Scatter(x=list(range(1, 20)), y=entropy_michael, mode='lines', name='Michael'), row=1, col=1)
fig.update_xaxes(type='log', row=1, col=1)

#dwight
fig.add_trace(go.Scatter(x=list(range(1, 20)), y=entropy_dwight, mode='lines', name='Dwight'), row=1, col=2)
fig.update_xaxes(type='log', row=1, col=2)

#jim
fig.add_trace(go.Scatter(x=list(range(1, 20)), y=entropy_jim, mode='lines', name='Jim'), row=2, col=1)
fig.update_xaxes(type='log', row=2, col=1)

#pam
fig.add_trace(go.Scatter(x=list(range(1, 20)), y=entropy_pam, mode='lines', name='Pam'), row=2, col=2)
fig.update_xaxes(type='log', row=2, col=2)

#show the plot
fig.update_layout(height=800, width=800, title_text='Entropies of characters\' lines by n-gram')
fig.show()
In [37]:
#saving the image
#fig.write_image('graphs/entropy_n_grams_main_characters.png', engine="kaleido")
In [38]:
#the show
fig = px.scatter(x=list(range(1, 20)),
                 y=entropy_office,
                 title='Entropy of The Office by n-gram',
                 labels={'x': 'n-gram', 'y': 'Entropy'},
                 color = entropy_office,)


fig.update_xaxes(type='log', range=[np.log2(1), np.log10(25)])

fig.show()
In [39]:
#saving the image
#fig.write_image('graphs/entropy_n_grams_whole_series.png', engine="kaleido")

Good-looking but difficult to look at, let's make another plot.

In [40]:
#let's plot them on the same graph

fig = go.Figure()

fig.add_trace(go.Scatter(x=list(range(1, 20)), y=entropy_michael, mode='lines', name='Michael'))
fig.add_trace(go.Scatter(x=list(range(1, 20)), y=entropy_dwight, mode='lines', name='Dwight'))
fig.add_trace(go.Scatter(x=list(range(1, 20)), y=entropy_jim, mode='lines', name='Jim'))
fig.add_trace(go.Scatter(x=list(range(1, 20)), y=entropy_pam, mode='lines', name='Pam'))
fig.add_trace(go.Scatter(x=list(range(1, 20)), y=entropy_office, mode='lines', name='Show'))

fig.update_xaxes(type='log')
fig.update_layout(title='Entropies of characters\' lines by n-gram')
fig.show()

The "Show" curve consistently has the highest entropy, reflecting the combined linguistic diversity across all characters and scenes. Michael’s lines display the highest entropy among individual characters, indicating his dialogue features greater variation and unpredictability, aligning with his dynamic and often eccentric personality. Dwight, Jim, and Pam show slightly lower and closely aligned entropy curves, suggesting more structured and predictable dialogue patterns, consistent with their roles as supporting or more grounded characters. The gradual flattening of all curves for higher n-grams reflects the diminishing increase in diversity as longer phrases are considered.

In [41]:
#saving the image
#fig.write_image('graphs/entropy_n_grams_main_characters_all_same_graph.png', engine="kaleido")

And now, let's calculate mean n-gram entropy for each character.

In [42]:
mean_ngram_entropy_michael = np.round(np.mean(entropy_michael), 2)
mean_ngram_entropy_dwight = np.round(np.mean(entropy_dwight), 2)
mean_ngram_entropy_jim = np.round(np.mean(entropy_jim), 2)
mean_ngram_entropy_pam = np.round(np.mean(entropy_pam), 2)
mean_ngram_entropy_andy = np.round(np.mean(entropy_andy), 2)
mean_ngram_entropy_toby = np.round(np.mean(entropy_toby), 2)
mean_ngram_entropy_stanley = np.round(np.mean(entropy_stanley), 2)
mean_ngram_entropy_kelly = np.round(np.mean(entropy_kelly), 2)
mean_ngram_entropy_ryan = np.round(np.mean(entropy_ryan), 2)
mean_ngram_entropy_phyllis = np.round(np.mean(entropy_phyllis), 2)
mean_ngram_entropy_oscar = np.round(np.mean(entropy_oscar), 2)
mean_ngram_entropy_darryl = np.round(np.mean(entropy_darryl), 2)
mean_ngram_entropy_jan = np.round(np.mean(entropy_jan), 2)
mean_ngram_entropy_creed = np.round(np.mean(entropy_creed), 2)
mean_ngram_entropy_meredith = np.round(np.mean(entropy_meredith), 2)
mean_ngram_entropy_angela = np.round(np.mean(entropy_angela), 2)
mean_ngram_entropy_kevin = np.round(np.mean(entropy_kevin), 2)
mean_ngram_entropy_erin = np.round(np.mean(entropy_erin), 2)

mean_ngram_entropy_show = np.round(np.mean(entropy_office), 2)

#creating a dataframe

mean_ngram_entropies = pd.DataFrame({'speaker': ['Show', 'Michael','Dwight','Jim','Pam','Andy','Toby','Stanley','Kelly','Ryan','Phyllis','Oscar','Darryl','Jan','Creed','Meredith','Angela','Kevin','Erin'],
                                    'Mean n-gram entropy': [mean_ngram_entropy_show, mean_ngram_entropy_michael,mean_ngram_entropy_dwight,mean_ngram_entropy_jim,mean_ngram_entropy_pam,mean_ngram_entropy_andy,mean_ngram_entropy_toby,mean_ngram_entropy_stanley,mean_ngram_entropy_kelly,mean_ngram_entropy_ryan,mean_ngram_entropy_phyllis,mean_ngram_entropy_oscar,mean_ngram_entropy_darryl,mean_ngram_entropy_jan,mean_ngram_entropy_creed,mean_ngram_entropy_meredith,mean_ngram_entropy_angela,mean_ngram_entropy_kevin,mean_ngram_entropy_erin]})

mean_ngram_entropies = mean_ngram_entropies.sort_values(by='Mean n-gram entropy', ascending=False)

fig = px.bar(mean_ngram_entropies,
             title = 'Mean n-gram Entropy by Speaker',
             x='speaker',
             y='Mean n-gram entropy',
             color='Mean n-gram entropy',
             text_auto=True)

fig.update_layout(yaxis_title='Mean n-gram Entropy', xaxis_title='Speaker')

fig.add_shape(
    name="show",
    showlegend=False,
    type="rect",
    line=dict(dash="dash"),
    x0=-0.4,
    x1=0.4,
    y0=0,
    y1=17.13,
)


fig.show()

The "Show" value, representing the combined entropy across all speakers, is the highest at 17.13, once again showing that the combined entropy across all speakers is higher and that of individuals. Among individual characters, Michael ranks highest at 16.03, followed by Dwight (15.68), Jim (15.23), and Andy (15.16), suggesting their dialogue exhibits the greatest diversity and unpredictability. Pam follows closely at 15.04, with most other characters, including Angela, Erin, and (surprisingly) Kevin, clustering around 13.9. Creed has the lowest mean entropy at 12.79, which could indicate that his line are the most predictable - which is not the case. This probably stems from the fact that his character's dialogues and scenes are sparse.

Overall, this distribution does highlight significant differences in the complexity of dialogue assigned to characters, aligning with their narrative roles and personality traits. Michael’s high entropy reflects his dynamic and erratic personality, while Creed’s low entropy suggests his minimalistic and idiosyncratic contributions.

In [43]:
#saving the image
#fig.write_image('graphs/entropy_n_grams_mean_by_speaker.png', engine="kaleido")

Calculating Word Entropy

Once again, time to define a function!

In [44]:
def word_entropy(text):
    words = text.split()
    counter = Counter(words)
    total = sum(counter.values())
    return -sum(count/total * log2(count/total) for count in counter.values())

Now, time to apply it to the entire dataframe.

In [45]:
office['word_entropy'] = office['line_formatted'].apply(word_entropy)

office.head()
#success
Out[45]:
season episode title scene speaker line line_formatted entropy entropy_formatted word_entropy
0 1 1 Pilot 1 Michael All right Jim. Your quarterlies look very good... all right jim your quarterlies look very good ... 4.239712 3.999839 3.807355
1 1 1 Pilot 1 Jim Oh, I told you. I couldn't close it. So... oh i told you i couldnt close it so 3.851149 3.364299 2.947703
2 1 1 Pilot 1 Michael So you've come to the master for guidance? Is ... so youve come to the master for guidance is th... 4.245317 3.975739 3.807355
3 1 1 Pilot 1 Jim Actually, you called me in here, but yeah. actually you called me in here but yeah 3.927418 3.675892 3.000000
4 1 1 Pilot 1 Michael All right. Well, let me show you how it's done. all right well let me show you how its done 3.987594 3.695948 3.321928

Creating dataframes for characters. This step could be avoided, as it is a 1:1 copy of the dataframe creation done during "character entropy" part. However, in this report I wanted to show a step-by-step process of this exploratory analysis, therefore the dataframes have to be created once again after applying the word_entropy function.

In [46]:
#re-creating the dataframes after adding the word_entropy column
michael = office[office['speaker'] == 'Michael']
dwight = office[office['speaker'] == 'Dwight']
jim = office[office['speaker'] == 'Jim']
pam = office[office['speaker'] == 'Pam']
andy = office[office['speaker'] == 'Andy']
toby = office[office['speaker'] == 'Toby']
stanley = office[office['speaker'] == 'Stanley']
kelly = office[office['speaker'] == 'Kelly']
ryan = office[office['speaker'] == 'Ryan']
phyllis = office[office['speaker'] == 'Phyllis']
oscar = office[office['speaker'] == 'Oscar']
darryl = office[office['speaker'] == 'Darryl']
jan = office[office['speaker'] == 'Jan']
creed = office[office['speaker'] == 'Creed']
meredith = office[office['speaker'] == 'Meredith']
angela = office[office['speaker'] == 'Angela']
kevin = office[office['speaker'] == 'Kevin']
erin = office[office['speaker'] == 'Erin']

#calculating mean word entropy for each character
michael_word_entropy = michael['word_entropy'].mean()
dwight_word_entropy = dwight['word_entropy'].mean()
jim_word_entropy = jim['word_entropy'].mean()
pam_word_entropy = pam['word_entropy'].mean()
andy_word_entropy = andy['word_entropy'].mean()
toby_word_entropy = toby['word_entropy'].mean()
stanley_word_entropy = stanley['word_entropy'].mean()
kelly_word_entropy = kelly['word_entropy'].mean()
ryan_word_entropy = ryan['word_entropy'].mean()
phyllis_word_entropy = phyllis['word_entropy'].mean()
oscar_word_entropy = oscar['word_entropy'].mean()
darryl_word_entropy = darryl['word_entropy'].mean()
jan_word_entropy = jan['word_entropy'].mean()
creed_word_entropy = creed['word_entropy'].mean()
meredith_word_entropy = meredith['word_entropy'].mean()
angela_word_entropy = angela['word_entropy'].mean()
kevin_word_entropy = kevin['word_entropy'].mean()
erin_word_entropy = erin['word_entropy'].mean()

#calculating word entropy for the entire show
office_word_entropy = office['word_entropy'].mean()

#print(office_word_entropy)

Creating a dataframe.

In [47]:
word_entropies = pd.DataFrame({'speaker': ['Show','Michael','Dwight','Jim','Pam','Andy','Toby','Stanley','Kelly','Ryan','Phyllis','Oscar','Darryl','Jan','Creed','Meredith','Angela','Kevin'],
                                'Word Entropy': [office_word_entropy,michael_word_entropy,dwight_word_entropy,jim_word_entropy,pam_word_entropy,andy_word_entropy,toby_word_entropy,stanley_word_entropy,kelly_word_entropy,ryan_word_entropy,phyllis_word_entropy,oscar_word_entropy,darryl_word_entropy,jan_word_entropy,creed_word_entropy,meredith_word_entropy,angela_word_entropy,kevin_word_entropy]})

#sorting the speakers by word entropy
word_entropies = word_entropies.sort_values(by='Word Entropy', ascending=False)

fig = px.bar(word_entropies,
                x='speaker',
                y='Word Entropy',
                title='Mean Word Entropy by Speaker',
                color = 'Word Entropy',
                text_auto = True,
)

fig.update_layout(yaxis_title='Word Entropy', xaxis_title='Speaker')

fig.add_shape(
    name="show",
    showlegend=False,
    type="rect",
    line=dict(dash="dash"),
    x0=5.4,
    x1=4.6,
    y0=0,
    y1=2.5,
)

fig.show()

Michael leads with the highest entropy (2.74), indicating his lines are linguistically diverse and unpredictable, aligning with his dynamic and erratic persona. Andy (2.68) and Kelly (2.67) follow closely, reflecting their often varied and distinctive speech styles. In contrast, Kevin has the lowest entropy (2.21), consistent with his simplistic and formulaic dialogue delivery, which is part of his comedic characterization. The "Show" entropy at 2.50 represents the average diversity across all speakers, while Creed's relatively high entropy (2.56) is notable given his concise and idiosyncratic lines, suggesting that while his dialogue is short, it remains varied.

In [48]:
#saving the image#
#fig.write_image('graphs/entropy_word_by_speaker.png', engine="kaleido")
In [49]:
#calculating mean word entropy by season

word_entropy_season_one = office[office['season'] == 1]['word_entropy'].mean()
word_entropy_season_two = office[office['season'] == 2]['word_entropy'].mean()
word_entropy_season_three = office[office['season'] == 3]['word_entropy'].mean()
word_entropy_season_four = office[office['season'] == 4]['word_entropy'].mean()
word_entropy_season_five = office[office['season'] == 5]['word_entropy'].mean()
word_entropy_season_six = office[office['season'] == 6]['word_entropy'].mean()
word_entropy_season_seven = office[office['season'] == 7]['word_entropy'].mean()
word_entropy_season_eight = office[office['season'] == 8]['word_entropy'].mean()
word_entropy_season_nine = office[office['season'] == 9]['word_entropy'].mean()

#creating a dataframe for the word entropy by season

seasons_word_entropy = pd.DataFrame({'Season': ['1', '2', '3', '4', '5', '6', '7', '8', '9'],
                                'Word Entropy': [word_entropy_season_one, word_entropy_season_two, word_entropy_season_three, word_entropy_season_four, word_entropy_season_five, word_entropy_season_six, word_entropy_season_seven, word_entropy_season_eight, word_entropy_season_nine]})

#plotting the data

fig = px.bar(seasons_word_entropy,
                x='Season',
                y='Word Entropy',
                title='Word Entropy by Season',
                color='Word Entropy',
                text_auto=True,)
fig.show()

Season 1 starts with relatively high entropy (2.55), indicating a diverse range of vocabulary, but entropy declines slightly in Seasons 2 (2.44) and 3 (2.46), suggesting a more streamlined or consistent linguistic style during these seasons. From Season 4 onward, entropy gradually increases, peaking in Season 9 (2.57), reflecting a return to more diverse and varied word usage. This trend could align with the show’s evolving narrative complexity and character development, as later seasons often expanded dialogue styles and character interactions. The consistent rise in entropy after Season 5 suggests a revitalization in scriptwriting or storytelling approach in the latter half of the series.

In [50]:
#saving the image
#fig.write_image('graphs/entropy_word_by_season.png', engine="kaleido")

Calculating meand word entropy by season for each major character.

In [51]:
michael_word_entropy_season_one = michael[michael['season'] == 1]['word_entropy'].mean()
michael_word_entropy_season_two = michael[michael['season'] == 2]['word_entropy'].mean()
michael_word_entropy_season_three = michael[michael['season'] == 3]['word_entropy'].mean()
michael_word_entropy_season_four = michael[michael['season'] == 4]['word_entropy'].mean()
michael_word_entropy_season_five = michael[michael['season'] == 5]['word_entropy'].mean()
michael_word_entropy_season_six = michael[michael['season'] == 6]['word_entropy'].mean()
michael_word_entropy_season_seven = michael[michael['season'] == 7]['word_entropy'].mean()

#creating a data frame

michael_word_entropies = pd.DataFrame({'season': ['Season 1', 'Season 2', 'Season 3', 'Season 4', 'Season 5', 'Season 6', 'Season 7'],
                                        'Word Entropy': [michael_word_entropy_season_one, michael_word_entropy_season_two, michael_word_entropy_season_three, michael_word_entropy_season_four, michael_word_entropy_season_five, michael_word_entropy_season_six, michael_word_entropy_season_seven]})

fig = px.bar(michael_word_entropies,
                x='season',
                y='Word Entropy',
                title='Michael\'s Word Entropy by Season',
                color = 'Word Entropy',
                text_auto = True,
                range_color=[2, 3.1]
    )

fig.update_layout(yaxis_title='Word Entropy', xaxis_title='Season')

fig.show()
                                       

Season 1 shows the highest entropy (3.08), indicating a broader range of word usage, potentially reflecting Michael's dynamic and evolving role in establishing the show's tone. From Season 2 to Season 5, entropy steadily decreases, reaching its lowest point in Season 5 (2.61). This decline suggests a possible standardization of Michael's dialogue as his character’s personality became more defined. However, a gradual increase in entropy is observed from Season 6 (2.68) to Season 7 (2.72), coinciding with his character's final season, possibly reflecting renewed complexity or nuance in his dialogue as his storyline approached its conclusion.

In [52]:
#saving the image
#fig.write_image('graphs/entropy_michael_word_entropy.png', engine="kaleido")
In [53]:
dwight_word_entropy_season_one = dwight[dwight['season'] == 1]['word_entropy'].mean()
dwight_word_entropy_season_two = dwight[dwight['season'] == 2]['word_entropy'].mean()
dwight_word_entropy_season_three = dwight[dwight['season'] == 3]['word_entropy'].mean()
dwight_word_entropy_season_four = dwight[dwight['season'] == 4]['word_entropy'].mean()
dwight_word_entropy_season_five = dwight[dwight['season'] == 5]['word_entropy'].mean()
dwight_word_entropy_season_six = dwight[dwight['season'] == 6]['word_entropy'].mean()
dwight_word_entropy_season_seven = dwight[dwight['season'] == 7]['word_entropy'].mean()
dwight_word_entropy_season_eight = dwight[dwight['season'] == 8]['word_entropy'].mean()
dwight_word_entropy_season_nine = dwight[dwight['season'] == 9]['word_entropy'].mean()

#creating a dataframe

dwight_word_entropies = pd.DataFrame({'Season': ['Season 1', 'Season 2', 'Season 3', 'Season 4', 'Season 5', 'Season 6', 'Season 7', 'Season 8', 'Season 9'],
                                'Word Entropy': [dwight_word_entropy_season_one, dwight_word_entropy_season_two, dwight_word_entropy_season_three, dwight_word_entropy_season_four, dwight_word_entropy_season_five, dwight_word_entropy_season_six, dwight_word_entropy_season_seven, dwight_word_entropy_season_eight, dwight_word_entropy_season_nine]})

fig = px.bar(dwight_word_entropies,
                x='Season',
                y='Word Entropy',
                title='Dwight\'s Word Entropy by Season',
                color = 'Word Entropy',
                text_auto = True,
                range_color=[2, 3.1]
    )

fig.update_layout(yaxis_title='Word Entropy', xaxis_title='Season')

fig.show()
                                      

Entropy begins at 2.55 in Season 1, then decreases slightly across Seasons 2 to 5, with a low of 2.42 in Season 2. This decline indicates a reduction in linguistic variety, possibly as Dwight's dialogue became more defined and formulaic to emphasize his eccentric personality traits. From Season 6 onward, entropy shows a consistent upward trend, peaking at 2.83 in Season 9. This increase in the later seasons suggests a broadening of Dwight’s dialogue, reflecting his evolving role in the narrative and more complex character arcs as the series progressed.

In [54]:
#saving the image
#fig.write_image('graphs/entropy_dwight_word_entropy.png', engine="kaleido")
In [55]:
jim_word_entropy_season_one = jim[jim['season'] == 1]['word_entropy'].mean()
jim_word_entropy_season_two = jim[jim['season'] == 2]['word_entropy'].mean()
jim_word_entropy_season_three = jim[jim['season'] == 3]['word_entropy'].mean()
jim_word_entropy_season_four = jim[jim['season'] == 4]['word_entropy'].mean()
jim_word_entropy_season_five = jim[jim['season'] == 5]['word_entropy'].mean()
jim_word_entropy_season_six = jim[jim['season'] == 6]['word_entropy'].mean()
jim_word_entropy_season_seven = jim[jim['season'] == 7]['word_entropy'].mean()
jim_word_entropy_season_eight = jim[jim['season'] == 8]['word_entropy'].mean()
jim_word_entropy_season_nine = jim[jim['season'] == 9]['word_entropy'].mean()

#creating a dataframe

jim_word_entropies = pd.DataFrame({'Season': ['Season 1', 'Season 2', 'Season 3', 'Season 4', 'Season 5', 'Season 6', 'Season 7', 'Season 8', 'Season 9'],
                                'Word Entropy': [jim_word_entropy_season_one, jim_word_entropy_season_two, jim_word_entropy_season_three, jim_word_entropy_season_four, jim_word_entropy_season_five, jim_word_entropy_season_six, jim_word_entropy_season_seven, jim_word_entropy_season_eight, jim_word_entropy_season_nine]})

fig = px.bar(jim_word_entropies,
                x='Season',
                y='Word Entropy',
                title='Jim\'s Word Entropy by Season',
                color = 'Word Entropy',
                text_auto = True,
                range_color=[2, 3.1]
    )

fig.update_layout(yaxis_title='Word Entropy', xaxis_title='Season')

fig.show()

In Season 1, entropy starts relatively high at 2.49, indicating a diverse range of words in his dialogue. However, entropy drops noticeably in Season 3 to 2.11, suggesting a simplification or greater consistency in his word usage during this period, possibly reflecting his role as a steadying influence in the narrative. Afterward, entropy stabilizes in Seasons 4 through 8, fluctuating between 2.29 and 2.38, before a notable increase in Season 9 to 2.63. This rise in the final season suggests renewed complexity or variety in his dialogue, potentially reflecting expanded narrative arcs or evolving interactions as the series approached its conclusion.

In [56]:
#saving the image
#fig.write_image('graphs/entropy_jim_word_entropy.png', engine="kaleido")
In [57]:
pam_word_entropy_season_one = pam[pam['season'] == 1]['word_entropy'].mean()
pam_word_entropy_season_two = pam[pam['season'] == 2]['word_entropy'].mean()
pam_word_entropy_season_three = pam[pam['season'] == 3]['word_entropy'].mean()
pam_word_entropy_season_four = pam[pam['season'] == 4]['word_entropy'].mean()
pam_word_entropy_season_five = pam[pam['season'] == 5]['word_entropy'].mean()
pam_word_entropy_season_six = pam[pam['season'] == 6]['word_entropy'].mean()
pam_word_entropy_season_seven = pam[pam['season'] == 7]['word_entropy'].mean()
pam_word_entropy_season_eight = pam[pam['season'] == 8]['word_entropy'].mean()
pam_word_entropy_season_nine = pam[pam['season'] == 9]['word_entropy'].mean()

#creating a dataframe

pam_word_entropies = pd.DataFrame({'Season': ['Season 1', 'Season 2', 'Season 3', 'Season 4', 'Season 5', 'Season 6', 'Season 7', 'Season 8', 'Season 9'],
                                'Word Entropy': [pam_word_entropy_season_one, pam_word_entropy_season_two, pam_word_entropy_season_three, pam_word_entropy_season_four, pam_word_entropy_season_five, pam_word_entropy_season_six, pam_word_entropy_season_seven, pam_word_entropy_season_eight, pam_word_entropy_season_nine]})

fig = px.bar(pam_word_entropies,
                x='Season',
                y='Word Entropy',
                title='Pam\'s Word Entropy by Season',
                color = 'Word Entropy',
                text_auto = True,
                range_color=[2, 3.1]
    )

fig.update_layout(yaxis_title='Word Entropy', xaxis_title='Season')

fig.show()

In Season 1, entropy is relatively low at 2.02, indicating a limited range of word usage early on. From Season 2 onward, entropy increases steadily, reaching its peak in Season 7 at 2.52, which suggests that her dialogue became more varied and complex as her character evolved and gained greater narrative significance. In Seasons 8 and 9, entropy stabilizes at slightly lower levels (2.40 and 2.49), maintaining high diversity but reflecting more consistency as her character’s role was well-established. This trend highlights Pam's progression from a supporting role to a more dynamic and central figure in the series.

In [58]:
#saving the image
#fig.write_image('graphs/entropy_pam_word_entropy.png', engine="kaleido")

I hope you enjoyed this analysis!¶

michael pointing fingers